AITopics | convergence property

Collaborating Authors

convergence property

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Huang, Feihu, Luo, Yuning, Chen, Songcan

arXiv.org Machine LearningMay-20-2026

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2605.19619

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

Deep Insights into Noisy Pseudo Labeling on Graph Data

Neural Information Processing SystemsApr-30-2026, 06:24:49 GMT

Pseudo labeling (PL) is a wide-applied strategy to enlarge the labeled dataset by self-annotating the potential samples during the training process. Several works have shown that it can improve the graph learning model performance in general. However, we notice that the incorrect labels can be fatal to the graph training process. Inappropriate PL may result in the performance degrading, especially on graph data where the noise can propagate. Surprisingly, the corresponding error is seldom theoretically analyzed in the literature.

artificial intelligence, machine learning, prediction, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization

Huang, Feihu, Zhang, Guanyi, Chen, Songcan

arXiv.org Machine LearningApr-17-2026

Lion optimizer is a popular learning-based optimization algorithm in machine learning, which shows impressive performance in training many deep learning models. Although convergence property of the Lion optimizer has been studied, its generalization analysis is still missing. To fill this gap, we study generalization property of the Lion via algorithmic stability based on the mathematical induction. Specifically, we prove that the Lion has a generalization error of $O(\frac{1}{Nτ^T})$, where $N$ is training sample size, and $τ>0$ denotes the smallest absolute value of non-zero element in gradient estimator, and $T$ is the total iteration number. In addition, we obtain an interesting byproduct that the SignSGD algorithm has the same generalization error as the Lion. To enhance generalization of the Lion, we design a novel efficient Cautious Lion (i.e., CLion) optimizer by cautiously using sign function. Moreover, we prove that our CLion has a lower generalization error of $O(\frac{1}{N})$ than $O(\frac{1}{Nτ^T})$ of the Lion, since the parameter $τ$ generally is very small. Meanwhile, we study convergence property of our CLion optimizer, and prove that our CLion has a fast convergence rate of $O(\frac{\sqrt{d}}{T^{1/4}})$ under $\ell_1$-norm of gradient for nonconvex stochastic optimization, where $d$ denotes the model dimension. Extensive numerical experiments demonstrate effectiveness of our CLion optimizer.

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Machine Learning

2604.14587

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Fundamental Convergence Analysis of Sharpness-Aware Minimization

Neural Information Processing SystemsMar-18-2026, 12:50:13 GMT

The paper investigates the fundamental convergence properties of Sharpness-Aware Minimization (SAM), a recently proposed gradient-based optimization method (Foret et al., 2021) that significantly improves the generalization of deep neural networks. The convergence properties including the stationarity of accumulation points, the convergence of the sequence of gradients to the origin, the sequence of function values to the optimal value, and the sequence of iterates to the optimal solution are established for the method. The universality of the provided convergence analysis based on inexact gradient descent frameworks (Khanh et al., 2023b) allows its extensions to the normalized versions of SAM such as F-SAM (Li et al. 2024), VaSSO (Li & Giannakis, 2023), RSAM (Liu et al., 2022), and to the unnormalized versions of SAM such as USAM (Andriushchenko & Flammarion, 2022). Numerical experiments are conducted on classification tasks using deep learning models to confirm the practical aspects of our analysis.

deep learning, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing SystemsMar-17-2026, 10:33:40 GMT

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.

artificial intelligence, gradient, machine learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Deep Insights into Noisy Pseudo Labeling on Graph Data Botao Wang

Neural Information Processing SystemsFeb-17-2026, 21:40:53 GMT

Pseudo labeling (PL) is a wide-applied strategy to enlarge the labeled dataset by self-annotating the potential samples during the training process.

machine learning, natural language, prediction, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

7ffb4e0ece07869880d51662a2234143-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-12-2026, 17:47:43 GMT

reviewer, softmax value, theorem, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.32)

Add feedback

6bdc60ac98a8eda10922d35eced6f969-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 15:14:59 GMT

algorithm, arxiv preprint arxiv, sppm, (10 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (0.68)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.94)
Information Technology > Game Theory (0.94)
(3 more...)

Add feedback

1959eb9d5a0f7ebc58ebde81d5df400d-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-7-2026, 15:45:54 GMT

consistency, neural network, selection, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

Neural Information Processing SystemsDec-25-2025, 02:32:39 GMT

Natural gradient descent has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, especially for \emph{non-linear} networks. In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. We identify two conditions which guarantee the global convergence: (1) the Jacobian matrix (of network's output for all training cases w.r.t the parameters) is full row rank and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized. We further extend our analysis to more general loss function with similar convergence property. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.

fast convergence, natural gradient descent, over-parameterized neural network, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)

Add feedback